INTERSPEECH.2018 - Analysis and Assessment

Total: 67

#1 Vowel Space as a Tool to Evaluate Articulation Problems [PDF] [Copy] [Kimi]

Authors: Rob van Son ; Catherine Middag ; Kris Demuynck

Treatment for oral tumors can lead to long term changes in the anatomy and physiology of the vocal tract and result in problems with articulation. There are currently no readily available automatic methods to evaluate changes in articulation. We developed a Praat script which plots and measures vowel space coverage. The script reproduces speaker specific vowel space use and speaking-style dependent vowel reduction in normal speech from a Dutch corpus. Speaker identity and speaking style explain more than 60% of the variance in the measured area of the vowel triangle. In recordings of patients treated for oral tumors, vowel space use before and after treatment is still significantly correlated. Articulation before and after treatment is evaluated in a listening experiment and from a maximal articulation speed task. Linear models can explain 50-75% of variance in perceptual ratings and relative articulation rate from values at previous recordings and vowel space measures.

#2 Towards a Better Characterization of Parkinsonian Speech: A Multidimensional Acoustic Study [PDF] [Copy] [Kimi1]

Authors: Veronique Delvaux ; Kathy Huet ; Myriam Piccaluga ; Sophie van Malderen ; Bernard Harmegnies

This paper reports on a first attempt at adopting a new perspective in characterizing speech disorders in Parkinson's Disease (PD) based on individual patient profiles. Acoustic data were collected on 13 Belgian French speakers with PD, 6 male, 7 female, aged 45-81 and 50 healthy controls (HC) using the "MonPaGe" protocol (Fougeron et al., LREC18). In this protocol, various kinds of linguistic material are recorded in different speech conditions, in order to assess multiple speech dimensions for each speaker. First, we compared a variety of voice and speech parameters across groups (HC vs. PD patients). Second, we examined individual profiles of PD patients. Results showed that as a group PD participants most systematically differed from HC in terms of speech tempo and rythm. Moreover, the analysis of individual profiles revealed that other parameters, related to pneumophonatory control and linguistic prosody, were valuable to describe the speech specificities of several PD patients.

#3 Self-similarity Matrix Based Intelligibility Assessment of Cleft Lip and Palate Speech [PDF] [Copy] [Kimi1]

Authors: Sishir Kalita ; S R Mahadeva Prasanna ; Samarendra Dandapat

This work presents a comparison based framework by exploiting the self-similarity matrices matching technique to estimate the speech intelligibility of cleft lip and palate (CLP) children. Self-similarity matrix (SSM) of a feature sequence is a square matrix, which encodes the acoustic-phonetic composition of the underlying speech signal. Deviations in the acoustic characteristics of underlying sound units due to the degradation of intelligibility will deviate the CLP speech’s SSM structure from that of normal. This degree of deviations in CLP speech’s SSM from the corresponding normal speech’s SSM may provide information about the severity profile of speech intelligibility. The degree of deviations is quantified using the structural similarity (SSIM) index, which is considered as the representative of objective intelligibility score. The proposed method is evaluated using two parameterizations of speech signals: Mel-frequency cepstral coefficients and Gaussian posteriorgrams and compared with dynamic time warping (DTW) based intelligibility assessment method. The proposed SSM based method shows the better correlation with the perceptual ratings of intelligibility when compared to the DTW based method.

#4 Pitch-Adaptive Front-end Feature for Hypernasality Detection [PDF] [Copy] [Kimi1]

Authors: Akhilesh Kumar Dubey ; S R Mahadeva Prasanna ; S Dandapat

Hypernasality in cleft palate (CP) children is due to the velopharyngeal insufficiency. The vowels get nasalized in hypernasal speech and the nasality evidence are mainly present in low-frequency region around the first formant (F1) of vowels. The detection of hypernasality using Mel-frequency cepstral coefficient (MFCC) feature may get affected because the feature might not be able to capture the nasality evidence present in the low-frequency region. This is due to the fact that the MFCC feature extracted from high pitched children speech contains the pitch harmonics effect of magnitude spectrum. The pitch harmonics effect results in high variance for the higher dimensions of MFCC coefficients. This problem may increase due to high perturbation in pitch of CP speech. So in this work, a pitch-adaptive MFCC feature is used for hypernasality detection. The feature is derived from the cepstral smooth spectrum instead of magnitude spectrum. A pitch-adaptive low time liftering is done to smooth out the pitch harmonics. This feature when used for the detection of hypernasality using support vector machine (SVM) gives an accuracy of 83.45%, 88.04% and 85.58% for /a/, /i/ and /u/ vowels respectively, which is better than the accuracy of MFCC feature.

#5 Detection of Amyotrophic Lateral Sclerosis (ALS) via Acoustic Analysis [PDF] [Copy] [Kimi1]

Authors: Raquel Norel ; Mary Pietrowicz ; Carla Agurto ; Shay Rishoni ; Guillermo Cecchi

ALS is a fatal neurodegenerative disease with no cure. Experts typically measure disease progression via the ALSFRS-R score, which includes measurements of various abilities known to decline. We propose instead the use of speech analysis as a proxy for ALS progression. This technique enables 1) frequent non-invasive, inexpensive, longitudinal analysis, 2) analysis of data recorded in the wild and 3) creation of an extensive ALS databank for future analysis. Patients and trained medical professionals need not be co-located, enabling more frequent monitoring of more patients from the convenience of their own homes. The goals of this study are the identification of acoustic speech features in naturalistic contexts which characterize disease progression and development of machine models which can recognize the presence and severity of the disease. We evaluated subjects from the Prize4Life Israel dataset, using a variety of frequency, spectral and voice quality features. The dataset was generated using the ALS Mobile Analyzer, a cell-phone app that collects data regarding disease progress using a self-reported ALSFRS-R questionnaire and several active tasks that measure speech and motor skills. Classification via leave-five-subjects-out cross-validation resulted in an accuracy rate of 79% (61% chance) for males and 83% (52% chance) for females.

#6 Detection of Glottal Activity Errors in Production of Stop Consonants in Children with Cleft Lip and Palate [PDF] [Copy] [Kimi1]

Authors: C M Vikram ; S R Mahadeva Prasanna ; Ajish K Abraham ; Pushpavathi M ; Girish K S

Individuals with cleft lip and palate (CLP) alter the glottal activity characteristics during the production of stop consonants. The presence/absence of glottal vibrations during the production of unvoiced/voiced stops is referred as glottal activity error (GAE). In this work, acoustic-phonetic and production based knowledge of stop consonants are exploited to propose an algorithm for the automatic detection of GAE. The algorithm uses zero frequency filtered and band-pass (500-4000 Hz) filtered speech signals to identify the syllable nuclei positions, followed by the detection of glottal activity characteristics of consonant present within the syllable. Based on the identified glottal activity characteristics of consonant and a priori voicing information of target stop consonant, the presence or absence of GAE is detected. The algorithm is evaluated over the database containing the responses of normal children and children with repaired CLP for the target consonant-vowel-consonant-vowel words with stop consonants.

#7 Joint Learning of Domain Classification and Out-of-Domain Detection with Dynamic Class Weighting for Satisficing False Acceptance Rates [PDF] [Copy] [Kimi1]

Authors: Joo-Kyung Kim ; Young-Bum Kim

In domain classification for spoken dialog systems, correct detection of out-of-domain (OOD) utterances is crucial because it reduces confusion and unnecessary interaction costs between users and the systems. Previous work usually utilizes OOD detectors that are trained separately from in-domain (IND) classifiers and confidence thresholding for OOD detection given target evaluation scores. In this paper, we introduce a neural joint learning model for domain classification and OOD detection, where dynamic class weighting is used during the model training to satisfice a given OOD false acceptance rate (FAR) while maximizing the domain classification accuracy. Evaluating on two domain classification tasks for the utterances from a large spoken dialogue system, we show that our approach significantly improves the domain classification performance with satisficing given target FARs.

#8 Analyzing Vocal Tract Movements During Speech Accommodation [PDF] [Copy] [Kimi1]

Authors: Sankar Mukherjee ; Thierry Legou ; Leonardo Lancia ; Pauline Hilt ; Alice Tomassini ; Luciano Fadiga ; Alessandro D'Ausilio ; Leonardo Badino ; Noël Nguyen

When two people engage in verbal interaction, they tend to accommodate on a variety of linguistic levels. Although recent attention has focused on to the acoustic characteristics of convergence in speech, the underlying articulatory mechanisms remain to be explored. Using 3D electromagnetic articulography (EMA), we simultaneously recorded articulatory movements in two speakers engaged in an interactive verbal game, the domino task. In this task, the two speakers take turn in chaining bi-syllabic words according to a rhyming rule. By using a robust speaker identification strategy, we identified for which specific words speakers converged or diverged. Then, we explored the different vocal tract features characterizing speech accommodation. Our results suggest that tongue movements tend to slow down during convergence whereas maximal jaw opening during convergence and divergence differs depending on syllable position.

#9 Cross-Lingual Multi-Task Neural Architecture for Spoken Language Understanding [PDF] [Copy] [Kimi1]

Authors: Yujiang Li ; Xuemin Zhao ; Weiqun Xu ; Yonghong Yan

Cross-lingual spoken language understanding (SLU) systems traditionally require machine translation services for language portability and liberation from human supervision. However, restriction exists in parallel corpora and model architectures. Assuming reliable data are provided with human-supervision, which encourages non-parallel corpora and alleviate translation errors, this paper aims to explore cross-lingual knowledge transfer from multiple levels by taking advantage of neural architectures. We first investigate a joint model of slot filling and intent determination for SLU, which alleviates the out-of-vocabulary problem and explicitly models dependencies between output labels by combining character and word representations, bidirectional Long Short-Term Memory and conditional random fields together, while attention-based classifier is introduced for intent determination. Knowledge transfer is further operated on character-level and sequence-level, aiming to share morphological and phonological information between languages with similar alphabets by sharing character representations and characterize the sequence with language-general and language-specific knowledge adaptively acquired by separate encoders. Experimental results on the MIT-Restaurant-Corpus and the ATIS corpora in different languages demonstrate the effectiveness of the proposed methods.

#10 Statistical Model Compression for Small-Footprint Natural Language Understanding [PDF] [Copy] [Kimi1]

Authors: Grant P. Strimel ; Kanthashree Mysore Sathyendra ; Stanislav Peshterliev

In this paper we investigate statistical model compression applied to natural language understanding (NLU) models. Small-footprint NLU models are important for enabling offline systems on hardware restricted devices and for decreasing on-demand model loading latency in cloud-based systems. To compress NLU models, we present two main techniques, parameter quantization and perfect feature hashing. These techniques are complementary to existing model pruning strategies such as L1 regularization. We performed experiments on a large scale NLU system. The results show that our approach achieves 14-fold reduction in memory usage compared to the original models with minimal predictive performance impact.

#11 Comparison of an End-to-end Trainable Dialogue System with a Modular Statistical Dialogue System [PDF] [Copy] [Kimi1]

Authors: Norbert Braunschweiler ; Alexandros Papangelis

This paper presents a comparison of two dialogue systems: one is end-to-end trainable and the other uses a more traditional, modular architecture. End-to-end trainable dialogue systems recently attracted a lot of attention because they offer several advantages over traditional systems. One of them is the avoidance to train each system module independently, by creating a single network architecture which maps an input to the corresponding output without the need for intermediate representations. While the end-to-end system investigated here had been tested in a text-in/out scenario it remained an open question how the system would perform in a speech-in/out scenario, with noisy input from a speech recognizer and output speech generated by a speech synthesizer. To evaluate this, both dialogue systems were trained on the same corpus, including human-human dialogues in the Cambridge restaurant domain, and then compared in both scenarios by human evaluation. The results show, that in both interfaces the end-to-end system receives significantly higher ratings on all metrics than the traditional modular system, an indication that it enables users to reach their goals faster and experience both a more natural system response and a better comprehension by the dialogue system.

#12 A Discriminative Acoustic-Prosodic Approach for Measuring Local Entrainment [PDF] [Copy] [Kimi1]

Authors: Megan Willi ; Stephanie A. Borrie ; Tyson S. Barrett ; Ming Tu ; Visar Berisha

Acoustic-prosodic entrainment describes the tendency of humans to align or adapt their speech acoustics to each other in conversation. This alignment of spoken behavior has important implications for conversational success. However, modeling the subtle nature of entrainment in spoken dialogue continues to pose a challenge. In this paper, we propose a straightforward definition for local entrainment in the speech domain and operationalize an algorithm based on this: acoustic-prosodic features that capture entrainment should be maximally different between real conversations involving two partners and sham conversations generated by randomly mixing the speaking turns from the original two conversational partners. We propose an approach for measuring local entrainment that quantifies alignment of behavior on a turn-by-turn basis, projecting the differences between interlocutors' acoustic-prosodic features for a given turn onto a discriminative feature subspace that maximizes the difference between real and sham conversations. We evaluate the method using the derived features to drive a classifier aiming to predict an objective measure of conversational success (i.e., low versus high), on a corpus of task-oriented conversations. The proposed entrainment approach achieves 72% classification accuracy using a Naive Bayes classifier, outperforming three previously established approaches evaluated on the same conversational corpus.

#13 Investigating Speech Features for Continuous Turn-Taking Prediction Using LSTMs [PDF] [Copy] [Kimi1]

Authors: Matthew Roddy ; Gabriel Skantze ; Naomi Harte

For spoken dialog systems to conduct fluid conversational interactions with users, the systems must be sensitive to turn-taking cues produced by a user. Models should be designed so that effective decisions can be made as to when it is appropriate, or not, for the system to speak. Traditional end-of-turn models, where decisions are made at utterance end-points, are limited in their ability to model fast turn-switches and overlap. A more flexible approach is to model turn-taking in a continuous manner using RNNs, where the system predicts speech probability scores for discrete frames within a future window. The continuous predictions represent generalized turn-taking behaviors observed in the training data and can be applied to make decisions that are not just limited to end-of-turn detection. In this paper, we investigate optimal speech-related feature sets for making predictions at pauses and overlaps in conversation. We find that while traditional acoustic features perform well, part-of-speech features generally perform worse than word features. We show that our current models outperform previously reported baselines.

#14 Classification of Correction Turns in Multilingual Dialogue Corpus [PDF] [Copy] [Kimi1]

Authors: Ivan Kraljevski ; Diane Hirschfeld

This paper presents a multiclass classification of correction dialog turns using machine learning. The classes are determined by the type of the introduced recognition errors while performing WOz trials and creating the multilingual corpus. Three datasets were obtained using different sets of acoustic-prosodic features on the multilingual dialogue corpus. The classification experiments were done using different machine learning paradigms: Decision Trees, Support Vector Machines and Deep Learning. After careful experiments setup and optimization on the hyper-parameter space, the obtained classification results were analyzed and compared in the terms of accuracy, precision, recall and F1 score. The achieved results are comparable with those obtained in similar experiments on different tasks and speech databases.

#15 Contextual Slot Carryover for Disparate Schemas [PDF] [Copy] [Kimi1]

Authors: Chetan Naik ; Arpit Gupta ; Hancheng Ge ; Mathias Lambert ; Ruhi Sarikaya

In the slot-filling paradigm, where a user can refer back to slots in the context during the conversation, the goal of the contextual understanding system is to resolve the referring expressions to the appropriate slots in the context. In large-scale multi-domain systems, this presents two challenges - scaling to a very large and potentially unbounded set of slot values and dealing with diverse schemas. We present a neural network architecture that addresses the slot value scalability challenge by reformulating the contextual interpretation as a decision to carryover a slot from a set of possible candidates. To deal with heterogenous schemas, we introduce a simple data-driven method for transforming the candidate slots. Our experiments show that our approach can scale to multiple domains and provides competitive results over a strong baseline.

#16 Capsule Networks for Low Resource Spoken Language Understanding [PDF] [Copy] [Kimi1]

Authors: Vincent Renkens ; Hugo van Hamme

Designing a spoken language understanding system for command-and-control applications can be challenging because of a wide variety of domains and users or because of a lack of training data. In this paper we discuss a system that learns from scratch from user demonstrations. This method has the advantage that the same system can be used for many domains and users without modifications and that no training data is required prior to deployment. The user is required to train the system, so for a user friendly experience it is crucial to minimize the required amount of data. In this paper we investigate whether a capsule network can make efficient use of the limited amount of available training data. We compare the proposed model to an approach based on Non-negative Matrix Factorisation which is the state-of-the-art in this setting and another deep learning approach that was recently introduced for end-to-end spoken language understanding. We show that the proposed model outperforms the baseline models for three command-and-control applications: controlling a small robot, a vocally guided card game and a home automation task.

#17 Intent Discovery Through Unsupervised Semantic Text Clustering [PDF] [Copy] [Kimi1]

Authors: A Padmasundari ; Srinivas Bangalore

Conversational systems need to understand spoken language to be able to converse with a human in a meaningful coherent manner. This understanding (Spoken Language understanding - SLU) of the human language is operationalized through identifying intents and entities. While classification methods that rely on labeled data are often used for SLU, creating large supervised data sets is extremely tedious and time consuming. This paper presents a practical approach to automate the process of intent discovery on unlabeled data sets of human language text through clustering techniques. We explore a range of representations for the texts and various clustering methods to validate the clustering stability through quantitative metrics like Adjusted Random Index (ARI). A final alignment of the clusters to the semantic intent is determined through consensus labelling. Our experiments on public datasets demonstrate the effectiveness of our approach generating homogeneous clusters with 89% cluster accuracy, leading to better semantic intent alignments. Furthermore, we illustrate that the clustering offer an alternate and effective way to mine sentence variants that can aid the bootstrapping of SLU models.

#18 Multimodal Polynomial Fusion for Detecting Driver Distraction [PDF] [Copy] [Kimi1]

Authors: Yulun Du ; Alan W Black ; Louis-Philippe Morency ; Maxine Eskenazi

Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone. Although there has been a considerable amount of research on modeling the distracted behavior of drivers under various conditions, accurate automatic detection using multiple modalities and especially the contribution of using the speech modality to improve accuracy has received little attention. This paper introduces a new multimodal dataset for distracted driving behavior and discusses automatic distraction detection using features from three modalities: facial expression, speech and car signals. Detailed multimodal feature analysis shows that adding more modalities monotonically increases the predictive accuracy of the model. Finally, a simple and effective multimodal fusion technique using a polynomial fusion layer shows superior distraction detection results compared to the baseline SVM and neural network models.

#19 Engagement Recognition in Spoken Dialogue via Neural Network by Aggregating Different Annotators' Models [PDF] [Copy] [Kimi1]

Authors: Koji Inoue ; Divesh Lala ; Katsuya Takanashi ; Tatsuya Kawahara

This paper addresses engagement recognition based on four multimodal listener behaviors - backchannels, laughing, eye-gaze and head nodding. Engagement is an indicator of how much a user is interested in the current dialogue. Multiple third-party annotators give ground truth labels of engagement in a human-robot interaction corpus. Since perception of engagement is subjective, the annotations are sometimes different between individual annotators. Conventional methods directly use integrated labels, such as those generated through simple majority voting and do not consider each annotator's recognition. We propose a two-step engagement recognition where each annotator's recognition is modeled and the different annotators' models are aggregated to recognize the integrated label. The proposed neural network consists of two parts. The first part corresponds to each annotator's model which is trained with the corresponding labels independently. The second part aggregates the different annotators' models to obtain one integrated label. After each part is pre-trained, the whole network is fine-tuned through back-propagation of prediction errors. Experimental results show that the proposed network outperforms baseline models which directly recognize the integrated label without considering differing annotations.

#20 A First Investigation of the Timing of Turn-taking in Ruuli [PDF] [Copy] [Kimi1]

Authors: Tuarik Buanzur ; Margaret Zellers ; Saudah Namyalo ; Alena Witzlack-Makarevich

Turn-taking behavior in conversation is reported to be universal among cultures, although the language-specific means used to accomplish smooth turn-taking are likely to differ. Previous studies investigating turn-taking have primarily focused on languages which are already heavily-studied. The current work investigates the timing of turn-taking in question-response sequences in naturalistic conversations in Ruuli, an under-studied Bantu language spoken in Uganda. We extracted sequences involving wh-questions and polar questions and measured the duration of the gap or overlap between questions and their following responses, additionally differentiating between different response types such as affirmative (i.e. type-conforming) or negative (i.e. non-type-conforming) responses to polar questions. We find that the timing of responses to various question types in Ruuli is consistent with timings that have been reported for a variety of other languages, with a mean gap duration between questions and responses of around 259 ms. Our findings thus emphasize the universal nature of turn-taking behavior in human interaction, despite Ruuli's substantial structural differences from languages in which turn-taking has been previously studied.

#21 On the Usefulness of the Speech Phase Spectrum for Pitch Extraction [PDF] [Copy] [Kimi1]

Authors: Erfan Loweimi ; Jon Barker ; Thomas Hain

Most frequency domain techniques for pitch extraction such as cepstrum, harmonic product spectrum (HPS) and summation residual harmonics (SRH) operate on the magnitude spectrum and turn it into a function in which the fundamental frequency emerges as argmax. In this paper, we investigate the extension of these three techniques to the phase and group delay (GD) domains. Our extensions exploit the observation that the bin at which F (magnitude) becomes maximum, for some monotonically increasing function F, is equivalent to bin at which F (phase) has maximum negative slope and F (group delay) has the maximum value. To extract the pitch track from speech phase spectrum, these techniques were coupled with the source-filter model in the phase domain that we proposed in earlier publications and a novel voicing detection algorithm proposed here. The accuracy and robustness of the phase-based pitch extraction techniques are illustrated and compared with their magnitude-based counterparts using six pitch evaluation metrics. On average, it is observed that the phase spectrum can be successfully employed in pitch tracking with comparable accuracy and robustness to the speech magnitude spectrum.

#22 Time-regularized Linear Prediction for Noise-robust Extraction of the Spectral Envelope of Speech [PDF] [Copy] [Kimi1]

Authors: Manu Airaksinen ; Lauri Juvela ; Okko Räsänen ; Paavo Alku

Feature extraction of speech signals is typically performed in short-time frames by assuming that the signal is stationary within each frame. For the extraction of the spectral envelope of speech, which conveys the formant frequencies produced by the resonances of the slowly varying vocal tract, an often used frame length is within 20-30 ms. However, this kind of conventional frame-based spectral analysis is oblivious of the broader temporal context of the signal and is prone to degradation by, for example, environmental noise. In this paper, we propose a new frame-based linear prediction (LP) analysis method that includes a regularization term that penalizes energy differences in consecutive frames of an all-pole spectral envelope model. This integrates the slowly varying nature of the vocal tract as a part of the analysis. Objective evaluations related to feature distortion and phonetic representational capability were performed by studying the properties of the mel-frequency cepstral coefficient (MFCC) representations computed from different spectral estimation methods under noisy conditions using the TIMIT database. The results show that the proposed time-regularized LP approach exhibits superior MFCC distortion behavior while simultaneously having the greatest average separability of different phoneme categories in comparison to the other methods.

#23 Auditory Filterbank Learning Using ConvRBM for Infant Cry Classification [PDF] [Copy] [Kimi1]

Authors: Hardik B. Sailor ; Hemant Patil

The infant cry classification is a socially-relevant problem where the task is to classify the normal vs. pathological cry signals. Since the cry signals are very different from the speech signals in terms of temporal and spectral content, there is a need for better feature representation for infant cry signals. In this paper, we propose to use unsupervised auditory filterbank learning using Convolutional Restricted Boltzmann Machine (ConvRBM). Analysis of the subband filters shows that most of the subband filters are Fourier-like basis functions. The infant cry classification experiments were performed on the two databases, namely, DA-IICT Cry and Baby Chillanto. The experimental results show that the proposed features perform better than the standard Mel Frequency Cepstral Coefficients (MFCC) using various statistically meaningful performance measures. In particular, our proposed ConvRBM-based features obtained an absolute improvement of 2% and 0.58% in the classification accuracy on the DA-IICT Cry and the Baby Chillanto database, respectively.

#24 Effectiveness of Dynamic Features in INCA and Temporal Context-INCA [PDF] [Copy] [Kimi1]

Authors: Nirmesh Shah ; Hemant Patil

Non-parallel Voice Conversion (VC) has gained significant attention since last one decade. Obtaining corresponding speech frames from both the source and target speakers before learning the mapping function in the non-parallel VC is a key step in the standalone VC task. Obtaining such corresponding pairs, is more challenging due to the fact that both the speakers may have uttered different utterances from same or the different languages. Iterative combination of a Nearest Neighbor search step and a Conversion step Alignment (INCA) and its variant Temporal Context (TC)-INCA are popular unsupervised alignment algorithms. The INCA and TC-INCA iteratively learn the mapping function after getting the Nearest Neighbor (NN) aligned pairs from the intermediate converted and the target spectral features. In this paper, we propose to use dynamic features along with static features to calculate the NN aligned pairs in both the INCA and TC-INCA algorithms (since the dynamic features are known to play a key role to differentiate major phonetic categories). We obtained on an average relative improvement of 13.75% and 5.39% with our proposed Dynamic INCA and Dynamic TC-INCA, respectively. This improvement is also positively reflected in the quality of converted voices.

#25 Singing Voice Phoneme Segmentation by Hierarchically Inferring Syllable and Phoneme Onset Positions [PDF] [Copy] [Kimi1]

Authors: Rong Gong ; Xavier Serra

In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using language-independent information - onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calculate the syllable and phoneme onset detection functions (ODFs) using a convolutional neural network (CNN). In the second step, the syllable and phoneme boundaries and labels are inferred hierarchically by using a duration-informed hidden Markov model (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and the ODFs as the emission probabilities into the HMM. The proposed method is designed in a language-independent way such that no phoneme class labels are used. For the model training and algorithm evaluation, we collect a new jingju (also known as Beijing or Peking opera) solo singing voice dataset and manually annotate the boundaries and labels at phrase, syllable and phoneme levels. The dataset is publicly available. The proposed method is compared with a baseline method based on hidden semi-Markov model (HSMM) forced alignment. The evaluation results show that the proposed method outperforms the baseline by a large margin regarding both segmentation and onset detection tasks.